HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633
HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633chihsuan wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
@ivandika3 Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?
|
@peterxcli thanks for checking.
Should have some effect on performance but not correctness, since Streaming Write Pipeline should be able to pick an arbitrary topology. The the data path (streaming WriteChunk data) can be sent to any primary (first node) is separated from the metadata path (PutBlock) which will be sent to the DN leader. The impact of suboptimal Streaming write pipeline topology should be worse write latency (e.g. if the topology picks the furthest node as the primary node). But this possible performance penalty also apply to read path (i.e. where the further node is read first) so I think it should be acceptable. Please let me know if I miss something. @chihsuan Thanks for the patch, I'll review this soon. |
@ivandika3 , @chihsuan , Would this change improve the overall performance? OM is supposed to be more CPU demanding than SCM. So, we probably want to save the CPU cycles in OM. What do you think? |
|
@szetszwo Thanks for checking this out. The reason I raised for this is to follow the HDDS-9343 which should make the sort datanodes logic done in OM.
I have not measured the overall performance difference yet, but for a single OM and single OM service, the performance improvement (if any) might not be that much.
That is a good point. However, in our cluster we have multiple OM services that points to a single SCM service, so I think SCM service resources should be protected more since it can be the bottleneck as more OM services point to the same SCM service. Additionally, ideally SCM should spend most of its times on background services (processing heartbeat reports, etc) and therefore should spend as little time as possible in the user foreground processing (i.e. allocating blocks, fetching read pipelines, etc) which includes sort datanodes. Please let me know what you think. This is not really a critical issue so I think we can defer it if you don't see any reason to support it now. |
Good to know that you are running with multiple OM services!
How about making it configurable? |
Yes, making it configurable is a good idea. |
|
Thanks @szetszwo and @ivandika3 for the discussion. I'll add an OM-side boolean (e.g. Does |
@chihsuan Thanks for the follow up, let's set the default to false to preserve the current behavior. We can remove this configuration in the future if there is a significant performance improvement. |
What changes were proposed in this pull request?
SCM sorts the write pipeline (nearest datanode first) on every
allocateBlock, on its block-allocation hot path. OM already caches the cluster topology (HDDS-9343) and sorts reads locally, so this PR moves the write sort to OM.OMKeyRequest.allocateBlocksends an emptyclientMachineto SCM (SCM skips sorting) and sorts each pipeline locally via a newKeyManager.sortDatanodesForWrite; the result is cached per pipeline.nodeManager.getNodedid.UserInfo.remoteAddressis always an IP), so a client co-located on a datanode is recognized even withhdds.datanode.use.datanode.hostnameenabled.SCMBlockProtocolServeris unchanged for rolling-upgrade safety: an old OM still gets SCM-side sorting, a new OM's empty address is a no-op for SCM. No Protobuf/RPC change.Note: SCM's
ALLOCATE_BLOCKaudit now logsclient=""for OM-originated writes; the per-client audit stays at OM.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15059
How was this patch tested?
TestOMAllocateBlockRequest): SCM receives an emptyclientMachine; the sorted order is applied to every block; a shared pipeline is sorted once.TestOMSortDatanodes): nearest datanode is first for writes, including for RPC-deserialized (protobuf round-tripped) pipeline nodes; order is preserved for an empty or unresolved client; the client is matched by both IP and hostname.build-branchCI: https://github.com/chihsuan/ozone/actions/runs/28455187435Generated-by: Claude Code (Claude Opus 4.8)